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Abstract — We study high-dimensional asymptotic performance 
limits of binary supervised classification problems where the class 
conditional densities are Gaussian with unknown means and 
covariances and the number of signal dimensions scales faster 
than the number of labeled training samples. We show that the 
Bayes error, namely the minimum attainable error probability 
with complete distributional knowledge and equally likely classes, 
can be arbitrarily close to zero and yet the limiting minimax error 
probability of every supervised learning algorithm is no better 
than a random coin toss. In contrast to related studies where 
the classification difficulty (Bayes error) is made to vanish, we 
hold it constant when taking high-dimensional limits. We also 
show that a nontrivial asymptotic minimax error probability can 
only be attained for parametric subsets of zero measure (in a 
suitable measure space). These results expose the fundamental 
importance of prior knowledge and suggest that unless we impose 
strong structural constraints, such as sparsity, on the parametric 
space, supervised learning may be ineffective in high dimensional 
small sample settings. 

I. Introduction 

In a number of applications ranging from medical imaging 
to economics, one encounters inference problems that suffer 
from the "curse of dimensionality", namely the situation 
where the observed signals are high-dimensional and we lack 
sufficient labeled training samples from which to accurately 
learn models and make reliable decisions. This may be true 
even when the underlying decision problem becomes "easy" 
with perfect knowledge of models or latent variables. For 
example, in detecting the presence or absence of stroke, high- 
dimensional tomographic X-ray projections are measured, 
though a stroke may affect only tissue properties in a localized 
spatial region and may be easily detectable if one knew where 
to look. Labeled training samples are typically limited in this 
context due to the high cost of engaging domain experts. 
Research in recent years has therefore focused on leveraging 
prior knowledge in the form of sparsity or other latent low- 
dimensional structure to improve decision making. 

Our aim in this work is to expose certain fundamental limi- 
tations of supervised learning in high dimensional small sam- 
ple settings and highlight the fundamental necessity of strong 
structural constraints (prior knowledge), such as sparsity, for 
attaining nontrivial asymptotic error rates. Towards this end 
we study the high-dimensional asymptotic performance limit 
of binary supervised classification where the class conditional 
densities are Gaussian with unknown means and covariances. 
If d and n respectively denote the number of signal dimensions 
and the number of labeled training samples, our focus is on 



the "big d small n" asymptotic regime where n/d as 
d —> oo. In previous related work, either n/d c > as 
d^ooor the classification difficulty (Bayes error) converges 
to zero as d — > oo JTJ, 12], or the focus was on special families 
of learning rules such as Naive-Bayes, banded covariance 
structure |3|, and plug-in rules j4|, Q. In contrast, we hold 
the Bayes error constant when taking high-dimensional limits. 

We establish two key results in this paper: 1) An im- 
possibility result: When the number of signal dimensions 
scales faster than the number of labeled samples at constant 
classification difficulty, the asymptotic minimax classification 
error probability of any supervised classification algorithm 
cannot converge to anything less than half. 2) Necessity of 
"structure" in parameter set: Nontrivial asymptotic minimax 
error probability is attainable only for parametric subsets of 
zero Haar measure. 

II. Problem Formulation 
Binary classification with Gaussian class conditional den- 



sities: Let X £ 



denote the observed signal (data), 



Y € { — the latent binary class label with py(+l) = 
Py{-1) = 1/2, and p X \Y,e(x\y,0) = M{Hy, S)(ar), where 
6 W l , 7^ are mean vectors for classes 
+1 and —1 respectively, £ is a covariance matrix common 
to both classes, and 9 := (/U+i, S) denotes the tuple of 
all parameters. Thus (X,Y)\9 ~ Px\Y,e{ x \Hi 8)PY (y)- The 
minimum probability of error classification rule, henceforth 
referred to as the optimum rule, is the maximum aposteriori 
probability (MAP) rule and is given by 



Vo P t{x)= argmax p Y \xfi{y\x,0) 

»£{-l,+l} 

= sign(A T £+(a;-^)) 
A 



(1) 



where /j, :— + M-i), A := — and S + is the 

pseudoinverse of S. The error probability of the optimum rule 
(Bayes error) is given by 

F e f t = QQ||(S+)M / i +1 - / i_ 1 )||^ =:Q(a/2) (2) 

where Q(t) := /" J= exp{-^ 2 }*, a 

— /x_i)||, and the minimum attainable error 
probability with complete distributional knowledge and 
equally likely classes, i.e., the "classification difficulty", is 
given by Q(a/2). We will assume that a > so that the 



Bayes error is nontrivial, i.e, strictly less than 1/2. 



III. Main Results 



Supervised classification rules: Let T n ■— {pQ,Fj),i = 

1, . . . , n} be a set of n labeled training samples that are 
independent and identically distributed (iid) according to 
~ Px\y,9Py where 8 belongs to a known set of feasible 
parameters 0. Let Xq be a test sample (independent of T n ) 
whose true class label Y is unobservable and needs to be 
estimated. A supervised classification rule is a measurable 
mapping from the data space to the set of class labels 



{-!,+!}• 



constructed using an algorithm that has access to the set 
of training samples T n and knowledge of 9 but no direct 
knowledge of 8 itself. 

Constant difficulty parameter sets: We are interested in 
taking constant-difficulty high-dimensional limits. To this end 
we define ©o(a) to be the set of all 8 values for which the 
Bayes error is equal to Q(a/2) (see (|2|): 

e„(a) := {(M+i^-i, E) : || (/x+i - = • 

A canonical subset of Go of particular interest is one in 
which the covariance is spherical and the means are con- 
strained to be on opposites ends of a d-dimensional sphere: 

e Sp h ere (a) := {(h,-h,/3 2 l) : ||h|| =1,0 = 2/ a) . 

Clearly, ©s P here(aO Q ©0(a)- This special parametric set 
corresponds to the scenario in which X can be represented 
as X — Yh + Z where Z is white Gaussian noise that is 
independent of Y. 

Error probabilities: Let 

denote the error probability of the classifier yj- n for a given 
training set T n and some parameter 8. Let 

P e \ e (vrJ -=^T n \e [Pe\Tn,e(frrJ\0] 

denote the expected error probability of the classifier yj- n 
averaged across training samples T n for some parameter 9. 
Finally let 

P e (n,d,Q,y T J := sup P e \ 9 (yrJ 
see 

be the maximum expected error probability of the classifier 
y-j- n over the parameter set O which depends on the number 
of labeled training samples n and the number of signal 
dimensions d. 

Goal: We aim to gain an understanding of how 
P e (n, d, 0, yT n ) behaves in the constant-difficulty high 
dimensional setting where d,n —f 00, n/d, n/rank(S + ) — > 0, 
and C 0q (a) is a sequence of constant-difficulty parameter 
sets. 



Theorem 1. For any sequence of classifiers y-f n , we have 
lim inf P e (n, d, sphere , VTn, ) > K 

(d,n/d)— >(oo,0) Z 

Corollary 1. For any sequence of parameter sets with 
'sphere Q 0. an d any sequence of classifiers y~r n , we have 

liminf P P (n,d,Q,VT ) > — 
(d,n/d)->( O o,0) " ~ 2 

Corollary 2. Let Q subset := {(h, -h,/3 2 I) e Sphere ,h e 
H C (S^" 1 } where <S d— 1 is the unit (d — l)-sphere in Mr. Let 
vol(-H) = Pr H ^ u(sd -i } (H G H), where U{S d - x ) denotes the 
uniform distribution over <S d_1 . If for a sequence of classifiers 

VTn- ^ 

lim SUp P e (n, d, Sphere , VTn) = o 
(d,n/d)->(oo,0) * 



and 



then 



lim vol(K) > 



lim sup P e (n, d, JU fo e ,, yr„ ) > ^ • 

(rf,n/d)^-(oo,0) * 

The proofs of the theorem and the two corollaries are 
presented in Section [V] 

IV. Discussion 

Theorem [TJ informs us that even in the "easier" scenario 
where the covariance is spherical and perfectly known, the 
worst case classification performance of any sequence of 
supervised classifiers is asymptotically no better than the 
classifier which flips an unbiased coin to make decisions. It is 
interesting to note that this conclusion holds for any arbitrarily 
small positive Bays error Q(a/2). 

An immediate corollary of this theorem is the impossibility 
of attaining less than half asymptotic error probability for 
larger parametric sets, specifically parameter sets that contain 
©Sphere such as ©o- These parameter sets contain more un- 
knowns (degrees of freedom), e.g., the covariance, but have 
the same classification difficulty Q{a/2). 

These results are consistent with previous results concerning 
the asymptotic behavior of the so-called plug-in family of 
classifiers where Maximum Likelihood (ML) estimates of 
parameters are plugged into the MAP rule given in Q): 

• Plug-in classification for ©o: In it was shown that for 
plug-in classifiers that use ML estimates of the means and 
covariance, the performance is asymptotically no better 
than half. 

• Sensing-aware classification: Here it is assumed that data 
is generated according to X — hW + Z, where h is the 
"sensing subspace", W is a class-dependent scalar latent 
variable and Z is white Gaussian noise independent of the 
class labels and latent variables. Assuming Gaussian class 
conditional densities for W, the class conditional density 
of X would be Gaussian with covariance 7 2 hh T + /3 2 I. 




Fig. 1, Exponential sparsity class Hexp (solid curves, top figure) and 
polynomial sparsity class H po i y (solid curves, bottom figure) for d = 3. 

Also the mean of the X would be in the same direction 
h for the two classes. In Q it was shown that the 
asymptotic probability of error of the ML projection 
classifier which is based on projecting data along the ML 
estimate of h is also no better than half. This can be 
seen as a special case of the Corollary 1, by considering 

©Sensing Aware 3 ©Sphere where 

©Sensing Aware («) '■= { (wih, TO 2 h, 7 2 hh T + /3 2 l) : 

IN = 1,7 > 0,0 > 0, K - ma I = cV 7 2 + /3 2 }- 

Another important consequence of the Theorem [T] is Corol- 
lary 2. This result can be interpreted as implying that nontrivial 
asymptotic minimax error probability is attainable only for 
parametric subsets of zero Haar measure. This highlights 
the necessity of some weak form of sparsity in the set of 
feasible h values in order to attain non-trivial asymptotic 
error probability. This is consistent with previous results that 
have shown that the supervised classification error can in fact 
converge to the Bayes error when h belongs to specific sparsity 
classes |g), 0. 

Specifically, in J5) it was shown that if the magnitudes of the 
components of h decay exponentially (or even polynomially) 



when reordered according to decreasing values, then h can 
be estimated consistently (in the mean square sense) by soft- 
thresholding the ML estimate of h and a classifier based 
on projecting data onto the estimate attains the Bayes error 
asymptotically. 

To represent these two sparsity classes, H exp and T-L vo i y 
were defined in [5| as bellow: 

Hexp = {h : |V)| = Mi(d)a fc ,0 < a < l} 
H P oiy = {h : \h (k) \ = M 2 (d)k-P,P> 0.5} 

where (hn\, . . . , h(d)) are me components of h in decreasing 
order of magnitude. Since there is only one degree of freedom 
in each set, the Haar measure of these two sets vanish when 
d > 3. Figure [T] illustrates these two sets (dark solid curves) on 
the unit sphere in 3-dimensional space. These sets satisfy the 
necessary conditions suggested by the Corollary 2 and in fact 
achieve the Bayes error asymptotically which is better than 
half. To summarize, it is essential to have strong structural 
assumptions on the feasible set of parameters in order to 
obtain non-trivial asymptotic classification performance in 
high-dimensional small sample settings. 

V. Proofs 

A. Proof of Theorem 1 

Proof. The key proof-idea is to randomize the selection of 
€ ©Sphere- The worst error probability over 9 £ ©sphere 
is not smaller than the average (expected) error probability 
when 8 is random. For any fixed value of 9, the training and 
test samples are totally independent but if 9 is random, they 
can become dependent through 9. Then the expected error 
probability (with respect to both training data and 9) of any 
supervised classification rule yj- n cannot be smaller than that 
of the (9-distribution-induced MAP rule based on T n - If the 
randomizing distribution for 9 is carefully selected then the 
lower bound can be made to converge to 1/2 as n and d 
scale. Now we work out the details of this proof-idea. 

Let 6> d_1 denote the unit (d — l)-sphere. For every 
heS d -\ ||h|| = 1 and (h,-h,/3 2 I) G ©sphere. Let 
9 - (H,-H,f3 2 T) where H - Uniform^" 1 ) which is 
independent of data labels Yj. Then, for every realization 
H = h, 9 G ©sphere and conditioned on H = h (equivalently 
conditioned on a realization of 9), the training and test samples 
have the joint distribution that was described earlier. By 
defining 

y M Ap(X ) = argmax py | Xo ,r„(yoko, 7^, ©sphere) (3) 

»06{-l.+l} 

and using the notation P e = P e (n, d, ©sphere: VT n )> we nave : 

P e >Eeee Sehm [P e \o®rJ] 

= E ee0Sphm [E r „| fl [P e \T n> e{VT n )]] 

- ^eee^n pr (yr n (X ) ± Y \T n , 9)] ^ 

> E e6esph „,r„ [Pr(y M Ap(^o) + Y \Tn,9)} 



where (i) holds because of the following argument. For any 
measurable classification rule yj- n , we have 

Pv(yrjx ) ^ro|r„,e Sphere ) > 

Pr(yMAp(^o) ^ Y \Tn, Bsphere) 

and hence 



E, 



)eespbmlT jPr(yr n (Xo)^Y \e,r n )}> 



Eeee Sph „|r„ [Pr(y MAP (X ) + Y \9,T n )} 

Finally, we obtain 

Er„ [Keee Splm \T n [Pr(yr„(*o) ¥= Y \d,T n )}] > 

E r „ [E eeesrhacin [Pr(y MAP (X Q ) ± Y \9,T n )}] 

(5) 

Using the notation Jmap — 2/map(A ), we have: 
Vmap = arg max p Yo \x ,r n {Va\%0, T n , 6s P here) 

«o£{-l,+l} 

= arg max px ,Y ,T n (xo, Vo,T n |Os P here) 

«o£{-l,+l} 

= arg max E ee0SphBK [px ,Y ,T„\o(xo, 2/0, %,\0)] 

Ho6{-l.+l} 



= arg max E ee e Spl ,c, 

!/o6{-l,+l} 



= arg max SLeet 
y e{-i,+i} 

(**0 to 
= arg max E# 



Y[px l ,Y t \e(xt,y l \0) 

=o 

JJpXiiY-i.e^ilj/i^) 



I/o6{-l,+l} 



1 



ar. 



n° x p 

1 ™ 

gmax E ff exp I - -— V \\x,\\' 
{-i,+i} L L 2/32^ 

^iff - 2^ #}] 



2 - t/ii/H 



(*«) to 

= arg max sLh 
y e{-l,+l} 

(to 1 

= arg max — 
!/ e{-i,+i} P' 



sign Uo ( XI 



exp ■ 

n 
t=0 



In the above derivation, (i) is because the training and 
test samples are conditionally independent given 9, (ii) is 
because py ; (+l) = 1) = 0.5 for all i, (iii) is because 

6> = (if, -H, (3 2 I) = E), (iv) is because ||#|| = 1 

and yi = ±1 for all i, and (i>) is due to the following result. 

Lemma 1. Le? fi be uniformly distributed on S 1 and 
f(x) :— Eh [exp{H T x}). Then f(x) is a radial function that 
is convex and nondecreasing in \\x\\. 

Proof. Since H is uniformly distributed on S^ 1 , f(x) is a 
radial function. Consider x = tu where t £ R and II nil = 1. 



If g(t) := f(tu) then g(0) = 1 and g is symmetric since 
the distribution of H is spherically symmetric. We also have 
g'{t) = E H [(H T u)exp{tH T u}} so that g'(0) = 0, again 
because the distribution of H is spherically symmetric. Finally, 
g"(t) = E H [(H T u) 2 exp{tH T u}} > which shows that g(t) 
is convex for t > 0. Since g(t) is convex for t > and 
g'(0) = 0, it is nondecreasing for t > 0. □ 

Continuing the proof, we have : 

P e > Ee,r„ [^ (&jap(*o, 7;) # >o|r„, 9)] 
= E h ,t, 



= E 



H,T„ 



Pr( Bign^^y^j 

i=l 



Tn, H 



(6) 



E 



Q 



/ll(ELi^) 

— (1 + H T V) 
P^1 + 2H T V + \\V\\ 2 



where V ~ A/"(0, ^r-Lj) and is independent of if. This follows 
as we can write X{ as X, t = YjH + Zi, where Z{ are white 
Gaussian noise with variance /3 2 , which are independent of 
H and labels Y t . Hence A £™ =1 W = H + k E?=i 
Finally, by taking ^ = ^ IXi it follows that 

~ A/"(0, —Id)- Note that (i) is proved in equation (Hh. 



Lemma 2. 



y/l + 2H^V+\\ 



— > 0, fli d — > 00. 



Proq/: Since E(iJ T ^) = E(iJ T )E(^) = 0, we have 
V&t{H t V) = E(H T VV T H) = E H E V]H (H T VV T H) 

P 
n 



:i) E H (^H 
n 



where both (i) and (ii) follows because of independence of 
H and V. Therefore, lim Va.r(H T V) = 0. As a result 1 + 

d— >oo 

Next, we will show that X&v{l + 2H T V +\\V\\ 2 ) = O (^). 
First, observe that E(l + 2H T V + \\V\\ 2 ) = 1 + f3 2 £. Thus 

Var(l + 2H T V+ \\V\\ 2 ) =E ((1 + 2H T V + \\V\\ 2 ) 2 ) 

1 + /3 2 

Furthermore, we have 

E ((1 + 2H T V + \\V\\ 2 ) 2 ) = 4E{H T VV T H) +E{\\V\\' i ) 

4/3 2 /n 

+ 4E(iJ T T/) + 2E(||^|| 2 ) 

2/3 2 £ 

+ 4E(H T V\\V\\ 2 )+1 
s v ' 



It remains to calculate E(||V|| 4 ). Note that E(||F|| 4 ) = 
Vai(V T V) + f3 4 ^. But we know that for a Gaussian ran- 
dom variable e ~ Af(n, S), and an arbitrary matrix A, we 



have Var(e T Ae) = 2tr^AEAE) + 4^ T AEA/i. Therefore, 
E(||y|| 4 ) = 2/3 4 ^ + /3 4 ^. Finally, we have 



Var(l + 2iJ 1 >+||V|| 



2/3 4 



O 



Asaresult, f (1 + 2H T V +\\V\\ 2 ) A /3 2 , because it converges 
in L 2 . We have 

fn l + H T V 

d 



W 



^(l + H^V+\\V\\ 2 ) 



Note that the numerator and denominator go to 1 and j3 in 
probability, respectively. Therefore, using Slutsky's Theorem, 
the whole fraction goes to 1//3 in probability. But ^pj goes 
to zero, therefore, W — > 0. □ 

Using Slutsky's theorem, Q{-W/0) A \. Since 
< Q(.) < 1, using Dominated Convergence Theorem, 
lim E [Q (-W//3 2 )] = §. Taking limit inferior of 

(d,n/d)— >-(oc,0) 

both sides of we finally conclude that for any yj- n 
liminf P e (n, d,0 Sphevei y Tn ) > - 

(d,n/d)—t(oc,Q) Z 

□ 

B. Proof of Corollary 1 

Proof. For any classifier yr n , we have : 

P e (n, d, 0, y Tn ) = sup PeieiyrJ 
see 

> Sup P e |e(?r„) = Pe(n,d, ©Sphere, 2/rJ 

because Gsphere Q ©• By taking limit inferior of two sides, the 
Corollary is proved. □ 

C. Proof of Corollary 2 

Proof Let vol(^) = Pr^^-i) (H £ H), where 
U(S d ^ 1 ) denotes the uniform distribution over the 
unit (d — l)-sphere in d dimensional space. Suppose 
that lim sup P e (n, d, ©subset, VT n ) < \- Hence 

(d,n/d)->(ao,0) 

limsup ^H~u(H){Pe\e(yT n )) < \- F° r tne specified 

(d,n/d) ->(oo,0) 

sequence of classifiers yj- n , which satisfies the conditions of 
the corollary, we have 

limsup E H ~u(s d -i)(Pe\e(yTn)) 

(d,n/d)->(ao,0) 

= limsup vol("H)EH~t7(«)(-Pe|fl(yr„)) 

(d,n/d)->(oo,0) 

+ ro\(n)E H ^ um (P ele (y Tn )) 

W 

< limsup vol('H) limsup E i y^ £/ (- H )(P e | fl (y r „)) 

(d,n/d)-y(oo,0) (d,n/d)->(cx),0) 

+ limsup vol(W) limsup E H ~U(u)( P e\e(yT n )) 

(d,n/d)->(oo,0) (d,n./rf)->(oo,0) 



(«) 1 / 

< - lim volCH) 

2 \(d,n/d)->(oo,0) 



lim vol('H) 

(d,n/t2)->(oo,0) 



which is a contradiction. In the previous equations, 
(i) follows by two properties of limit superior, namely 
Hmsup 7l _ > . 0O / ri + g n < limsup^^ /„ + limsup^^^ 
and limsup^^ f n g n < lim sup^^ /„ lim sup^^ g n for 
bounded and positive sequences f n and g n . Moreover, (ii) 
follows, because it is assumed that the limit of vol("H) is non- 
zero. □ 

VI. Concluding Remarks 

That prior knowledge such as sparsity improves inference in 
high dimensional small sample settings is folklore. The results 
presented here show that in fact such knowledge is absolutely 
indispensable in that otherwise the asymptotic performance 
degenerates to a random coin toss. The results presented here 
focused on supervised binary classification with Gaussian class 
conditional densities and equally likely classes. One could 
expect similar conclusions to hold in more complex inference 
problems. However the proof techniques used in this work may 
not generalize to more complex and non-Gaussian settings. 

Acknowledgment 

This article is based upon work supported by the U.S. 
AFOSR and U.S. NSF under award numbers #FA9550-10- 
1-0458 (subaward #A1795) and #1218992 respectively. The 
views and conclusions contained in this article are those 
of the authors and should not be interpreted as necessarily 
representing the official policies, either expressed or implied, 
of the U.S. AFOSR or U.S. NSF. 

References 

[1] D. Donoho and J. Jin, "Higher criticism for detecting sparse heteroge- 
neous mixtures," Annals of Statistics, vol. 32, no. 3, pp. 962-994, 2004. 

[2] A. Singh, R. D. Nowak, and A. R. Calderbank, "Detecting weak 
but hierarchically-structured patterns in networks," in International 
Conference on Artificial Intelligence and Statistics (AISTATS), 2010, pp. 
749-756. 

[3] P. J. Bickel and E. Levina, "Some theory for fishers linear discriminant 
function, naive bayes, and some alternatives when there are many more 
variables than observations," Bernoulli, vol. 10, no. 6, pp. 989-1010, 
2004. 

[4] J. Shao, Y. Wang, X. Deng, and S. Wang, "Sparse linear discriminant 
analysis by thresholding for high dimensional data," Annals of Statistics, 
vol. 39, no. 2, pp. 1241-1265, 2012. 

[5] B. Orten, P. Ishwar, W. C. Karl, and V. Saligrama, "Sensing structure 
in learning-based binary classification of high-dimensional data: Oppor- 
tunities and perils," in Annual Allerton Conference on Communication, 
Control and Computing, 2011. 



< 



1 



